AITopics | memory bandwidth

Collaborating Authors

memory bandwidth

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Generative Profiling for Soft Real-Time Systems and its Applications to Resource Allocation

Bondar, Georgiy A., Eisenklam, Abigail, Cai, Yifan, Gifford, Robert, Sial, Tushar, Phan, Linh Thi Xuan, Halder, Abhishek

arXiv.org Machine LearningApr-3-2026

Modern real-time systems require accurate characterization of task timing behavior to ensure predictable performance, particularly on complex hardware architectures. Existing methods, such as worst-case execution time analysis, often fail to capture the fine-grained timing behaviors of a task under varying resource contexts (e.g., an allocation of cache, memory bandwidth, and CPU frequency), which is necessary to achieve efficient resource utilization. In this paper, we introduce a novel generative profiling approach that synthesizes context-dependent, fine-grained timing profiles for real-time tasks, including those for unmeasured resource allocations. Our approach leverages a nonparametric, conditional multi-marginal Schrödinger Bridge (MSB) formulation to generate accurate execution profiles for unseen resource contexts, with maximum likelihood guarantees. We demonstrate the efficiency and effectiveness of our approach through real-world benchmarks, and showcase its practical utility in a representative case study of adaptive multicore resource allocation for real-time systems.

artificial intelligence, machine learning, real time system, (19 more...)

arXiv.org Machine Learning

2604.01441

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Pennsylvania (0.04)
North America > United States > California > Santa Cruz County > Santa Cruz (0.04)
(3 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Architecture > Real Time Systems (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.49)

Add feedback

76e952a4e83d97186d3f55eef6a3a367-Paper-Conference.pdf

Neural Information Processing SystemsFeb-15-2026, 22:39:19 GMT

artificial intelligence, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)

Genre: Research Report (0.93)

Industry:

Education (0.67)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Vision (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)

Add feedback

FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference

Nrusimha, Aniruddha, Brandon, William, Mishra, Mayank, Shen, Yikang, Panda, Rameswar, Ragan-Kelley, Jonathan, Kim, Yoon

arXiv.org Artificial IntelligenceDec-5-2025

The size and compute characteristics of modern large language models have led to an increased interest in developing specialized kernels tailored for particular training and inference workloads. Existing kernels primarily optimize for compute utilization, targeting the large-batch training and inference settings. However, low-batch inference, where memory bandwidth and kernel launch overheads are significant factors, remains important for many applications of interest such as in edge deployment and latency-sensitive applications. This paper describes FlashFormer, which fuses the entire transformer forward pass into a single kernel for accelerating low-batch inference of large language models. Across various model sizes and quantizations settings, FlashFormer achieves nontrivial speedups compared to existing inference kernels.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2505.22758

Country: North America > United States > Massachusetts (0.46)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration

Metere, Alfredo

arXiv.org Artificial IntelligenceNov-25-2025

Large matrix multiplication is a cornerstone of modern machine learning workloads, yet traditional approaches suffer from cubic computational complexity (e.g., $\mathcal{O}(n^3)$ for a matrix of size $n\times n$). We present Low-Rank GEMM, a novel approach that leverages low-rank matrix approximations to achieve sub-quadratic complexity while maintaining hardware-accelerated performance through FP8 precision and intelligent kernel selection. On a NVIDIA RTX 4090, our implementation achieves up to 378 TFLOPS on matrices up to $N=20480$, providing 75\% memory savings and $7.8\times$ speedup over PyTorch FP32 for large matrices. The system automatically adapts to hardware capabilities, selecting optimal decomposition methods (SVD, randomized SVD) and precision levels based on matrix characteristics and available accelerators. Comprehensive benchmarking on NVIDIA RTX 4090 demonstrates that Low-Rank GEMM becomes the fastest approach for matrices $N\geq10240$, surpassing traditional cuBLAS implementations through memory bandwidth optimization rather than computational shortcuts.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.18674

Genre: Research Report > Promising Solution (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Role of High-Performance GPU Resources in Large Language Model Based Radiology Imaging Diagnosis

Kao, Jyun-Ping

arXiv.org Artificial IntelligenceNov-11-2025

Large-language models (LLMs) are rapidly being applied to radiology, enabling automated image interpretation and report generation tasks. Their deployment in clinical practice requires both high diagnostic accuracy and low inference latency, which in turn demands powerful hardware. High-performance graphical processing units (GPUs) provide the necessary compute and memory throughput to run large LLMs on imaging data. We review modern GPU architectures (e.g. NVIDIA A100/H100, AMD Instinct MI250X/MI300) and key performance metrics of floating-point throughput, memory bandwidth, VRAM capacity. We show how these hardware capabilities affect radiology tasks: for example, generating reports or detecting findings on CheXpert and MIMIC-CXR images is computationally intensive and benefits from GPU parallelism and tensor-core acceleration. Empirical studies indicate that using appropriate GPU resources can reduce inference time and improve throughput. We discuss practical challenges including privacy, deployment, cost, power and optimization strategies: mixed-precision, quantization, compression, and multi-GPU scaling. Finally, we anticipate that next-generation features (8-bit tensor cores, enhanced interconnect) will further enable on-premise and federated radiology AI. Advancing GPU infrastructure is essential for safe, efficient LLM-based radiology diagnostics.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.16328

Genre: Research Report (0.40)

Industry:

Health & Medicine > Nuclear Medicine (1.00)
Health & Medicine > Diagnostic Medicine > Imaging (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Apple Just Upgraded the iPad Pro, MacBook Pro, and Vision Pro with Its New M5 Chip

WIREDOct-15-2025, 15:11:26 GMT

The hardware largely remains the same, but performance gets a boost. All products featured on WIRED are independently selected by our editors. However, we may receive compensation from retailers and/or from purchases of products through these links. Without much fanfare, Apple has unveiled three new flagship products today via a press release--no special event, no pre-recorded show. That might be because the new iPad Pro, MacBook Pro, and Vision Pro don't change the mold--they're identical to their predecessors--but internally, they're debuting Apple's highly anticipated M5 chip.

apple, ipad, macbook, (13 more...)

WIRED

Country:

Asia > Middle East > Palestine > Gaza Strip > Gaza Governorate > Gaza (0.05)
North America > United States > California (0.05)
Europe > Slovakia (0.05)
Europe > Czechia (0.05)

Industry:

Information Technology (0.70)
Transportation > Ground > Road (0.70)

Technology:

Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.48)

Add feedback

Faster Neighborhood Attention: Reducing the O (n 2) Cost of Self Attention at the Threadblock Level

Neural Information Processing SystemsOct-10-2025, 06:30:35 GMT

Neighborhood attention reduces the cost of self attention by restricting each token's attention span to its nearest neighbors. This restriction, parameterized by a window size and dilation factor, draws a spectrum of possible attention patterns between linear projection and self attention. Neighborhood attention, and more generally sliding window attention patterns, have long been bounded by infrastructure, particularly in higher-rank spaces (2-D and 3-D), calling for the development of custom kernels, which have been limited in either functionality, or performance, if not both.

implementation, kernel, neighborhood attention, (17 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)

Genre: Research Report (0.93)

Industry:

Education (0.67)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)
Information Technology > Artificial Intelligence > Vision (0.68)

Add feedback

SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference

Zhang, Hengrui, Patel, Pratyush, Ning, August, Wentzlaff, David

arXiv.org Artificial IntelligenceOct-10-2025

Large Language Models (LLMs) have gained popularity in recent years, driving up the demand for inference. LLM inference is composed of two phases with distinct characteristics: a compute-bound prefill phase followed by a memory-bound decode phase. To efficiently serve LLMs, prior work proposes prefill-decode disaggregation to run each phase on separate hardware. However, existing hardware poorly matches the different requirements of each phase. Current datacenter GPUs and TPUs follow a more-is-better design philosophy that maximizes compute and memory resources, causing memory bandwidth underutilization in the prefill phase and compute underutilization in the decode phase. Such underutilization directly translates into increased serving costs. This paper proposes SPAD (Specialized Prefill and Decode hardware), adopting a less-is-more methodology to design specialized chips tailored to the distinct characteristics of prefill and decode phases. The proposed Prefill Chips have larger systolic arrays and use cost-effective GDDR memory, whereas the proposed Decode Chips retain high memory bandwidth but reduce compute capacity. Compared to modeled H100s, simulations show that the proposed Prefill Chips deliver 8% higher prefill performance on average at 52% lower hardware cost, while the proposed Decode Chips achieve 97% of the decode performance with 28% lower TDP. End-to-end simulations on production traces show that SPAD reduces hardware cost by 19%-41% and TDP by 2%-17% compared to modeled baseline clusters while offering the same performance. Even when models and workloads change, SPAD can reallocate either type of chip to run either phase and still achieve 11%-43% lower hardware costs, demonstrating the longevity of the SPAD design.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.08544

Country:

North America > United States > California (0.67)
North America > United States > New York > New York County > New York City (0.14)

Genre: Research Report (0.82)

Industry: Information Technology > Services (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

RecIS: Sparse to Dense, A Unified Training Framework for Recommendation Models

Zong, Hua, Zeng, Qingtao, Zhou, Zhengxiong, Han, Zhihua, Yan, Zhensong, Liu, Mingjie, Sun, Hechen, Liu, Jiawei, Hu, Yiwen, Wang, Qi, Xian, YiHan, Guo, Wenjie, Xiang, Houyuan, Zeng, Zhiyuan, Sheng, Xiangrong, Yan, Bencheng, Hu, Nan, Huang, Yuheng, Lian, Jinqing, Xu, Ziru, Zhang, Yan, Huang, Ju, Yang, Siran, Yi, Huimin, Wang, Jiamang, Wang, Pengjie, Zhu, Han, Wu, Jian, Ou, Dan, Xu, Jian, Tang, Haihong, Jiang, Yuning, Zheng, Bo, Qu, Lin

arXiv.org Artificial IntelligenceSep-26-2025

In this paper, we propose RecIS, a unified Sparse-Dense training framework designed to achieve two primary goals: 1. Unified Framework To create a Unified sparse-dense training framework based on the PyTorch ecosystem that meets the training needs of industrial-grade recommendation models that integrated with large models. 2.System Optimization To optimize the sparse component, offering superior efficiency over the TensorFlow-based recommendation models. The dense component, meanwhile, leverages existing optimization technologies within the PyTorch ecosystem. Currently, RecIS is being used in Alibaba for numerous large-model enhanced recommendation training tasks, and some traditional sparse models have also begun training in it.

artificial intelligence, machine learning, recommendation system, (17 more...)

arXiv.org Artificial Intelligence

2509.20883

Genre: Research Report (0.64)

Industry:

Information Technology (0.69)
Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Filters

Collaborating Authors

memory bandwidth

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Generative Profiling for Soft Real-Time Systems and its Applications to Resource Allocation

76e952a4e83d97186d3f55eef6a3a367-Paper-Conference.pdf

639d992f819c2b40387d4d5170b8ffd7-Paper-Conference.pdf

FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference

Low-Rank GEMM: Efficient Matrix Multiplication via Low-Rank Approximation with FP8 Acceleration

The Role of High-Performance GPU Resources in Large Language Model Based Radiology Imaging Diagnosis

Apple Just Upgraded the iPad Pro, MacBook Pro, and Vision Pro with Its New M5 Chip

Faster Neighborhood Attention: Reducing the O (n 2) Cost of Self Attention at the Threadblock Level

SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference

RecIS: Sparse to Dense, A Unified Training Framework for Recommendation Models